ML Tutorials: Data Collection Techniques

Before beginning any Machine Learning, you need to have data. In the age of the Internet of Things, large amounts of data can be collected from a wide variety of sources including internal Data Warehouses, IOT devices (such as sensors or cameras), and APIs. As an ML Engineer, you need to understand how to collect data from each of these sources to give your models an adewuate amount of training data. This tutorial will cover the following learning objectives:

Collecting Data from Data Warehouses
Collecting Data from APIs
Collecting Data from IOT Devices
Collecting Data from Data Lakes

Collecting Data from Data Warehouses

What is a Data Warehouse?

Why Do Data Scientists Need to Know SQL?

Summary

A Data Warehouse is a storage device used by businesses to store large amounts of structured data. These are used to act as a single source of truth to ensure data quality standards are maintained.
Structured Data refers to data that has a strict schema assigned to it. The most common types of Structured Data include relational databases and text files such as CSV and Excel.
Just like in relational databases, Data Warehouses are subject-oriented. This means that every entity, known as tables, are granular to a single subject (e.g., customers, orders, employees).
What makes Data Warehouses so popular for use in Data Science is the fact that Data Warehouses are Non-Volatile, meaning that once data arrives in the Data Warehouse, it's guaranteed to be clean, accurate, and useable for business intelligence, reporting, and machine learning.
Structured Query Language (SQL) is a programming language used primarily by databases to allow users to gather data in a useable format.
As an ML Engineer, SQL will be one of your most valuable skills as you'll need to collect large amounts of data from Data Warehouses efficiently and correctly.
As a baseline, 95% of the time, you'll be working exclusively with SELECT statements. Thus you should understand the syntax, semantics, and order of execution of these statements. Also, you need to have a general understanding of Primary and Foreign Keys, Joins, and Unions to gather data from multiple tables.
TO GET STARTED WITH SQL, CHECK OUT OUR TUTORIALS: SQL Tutorials

Collecting Data from APIs

What is an API?

What are HTTP Request Methods?

Summary

An Application Programming Interface (API) is simply a tool used to connect applications. An Application can be thought of as a collection fo code run on a server.
When sending data between applications, the data collected must be the same "shape", or have the same definition as stated in the backend. This shape of data is known as a schema and is used to explicitly define the required fields for each input record.
When using an external API (an API developed by an organization other than your own), you'll liekly need an API Key. This acts as a login token to grant you access to the API's backend. These are used to prevent hackers from performing Brute Force attacks on the API's servers.
API Documentation is used to keep track of all the different types of records stored in the API backend, what they are used for, what permissions are required, and how to obtain an API Key. Before using any API, it's critical to read through the documentation to get a better understanding of how to get the correct information.
When a client communicates with a server, it needs to specify which action the server needs to take. This is done via HTTP Requests. There are several HTTP Request Methods:
- GET: This is used to retrieve data from a source. This is the most common method used by ML Engineers during the Data Collection stage of the ML Lifecycle.
- PUT: This is used to update data within a source. This is used by Software Engineers to continuously update URL paths on a website or data objects on a backend server.
- POST: This is used to insert new data to a source. This is used by ML Engineers to input users entries into the training dataset. This is used at the end of the ML Lifecycle when you have deployed your model into Production.
- DELETE: This is used to remove data from a source. This is used by Software Engineers to continuously keep the data presented on websites or backend servers relevant and comply with data security standards.
JavaScript Object Notation (JSON) is the object format used by APIs to transport data between applications. This data is represented in a key-value pair format, similar to a dictionary in Python.

Collecting Data from IOT Devices

What is the Internet of Things (IOT)?

What is Data Streaming?

Summary

The Internet of Things (IOT) is the connection and collaboration of internet-based devices to create "Smart" appliances. An IOT device is simply any piece of technology that can access the internet, such as televisions, bluetooth speakers, smart thermometers, and security cameras.
IOT devices can be separated into two categories:
- General Devices: These are generic pieces of technology that are typically connected via Wi-Fi or bluetooth. Examples: Television, Smart Speaker, Smart Lightbulbs.
- Sensing Devices: These are hardwired devices that are used to collect data in real-time for use in either predicitive analytics or descriptive analytics. These use data streams to send data in real-time to mobile applications for user consumption.
Data Streaming is the continuous flow of data as it's generated from a source so that it can be used for real-time processing and analytics.
Batch Scheduling refers to the collection and processing of data in batches at a regularly scheduled interval. A good example of this is retail transactions. Rather than getting data on how much money a retail store made every hour, it can be collected and aggregated on a daily basis.
In the context of machine learning, Data Streaming is useful for creating real-time predicitons. For example, suppose you are tasked with creating a model that predicts, in real-time whether a financial transaction is fraudulent or not. Rather than querying a data warehouse every minute or hour to analyze these records, it would be helpful to get the results as close to real-time as possible.
Data Streams capture events. In the context of Data Streaming, Events can be defined as an action that occurs at a specific point in time. This could be when a user clicks a button on a website, a sensor is triggered based on an algorithm, or a sale takes place online.
Producers are the data sources for data streams. This is where IOT comes into play. IOT devices can generate large amounts of data depending on their use cases. These producers are setup to write data to a log file, to maintain data integrity, then each record is placed in a queue, similar to a line in a supermarket.
Consumers are the end point that the consumers are sending data to. This could be a staging area, such as a cloud storage bucket or data lake, or a NoSQL database designed to handle incoming data streams, such as Apache Druid.
NOTE: To help keep Data Streams efficient, message queues are typically parallel processed by aggregating similar events into separate queues. This helps keep things organized and running smoothly.

Collecting Data from Data Lakes

Summary

A Data Lake is a centralized repository for all raw, unstructured data within an organization. These are commonly used to store images, binary files, log files, and other types of data that can't be stored in a traditional relational database architecture.
Nowadays, the most common form of Data Lakes are Cloud Storage Buckets. These allow you to store very large amounts of data (think petabyte-scale) in a single location, or multiple locations to sort items into groups, without having to fit a strict schema.
Rather than using SQL or a similar query language to retrieve files from a data lake, cloud providers provide APIs and Python libraries to allow easy access to objects stored within a specific bucket.

Previous Topic

Next Topic